DATA602 Principles of Data Science


Fall Semester, 2024


Prof. Fardina Alam


Retail Recommendation Engine


Group Members

Anto Delin Xavier

Athreya Sudarshan Srikanth

Pranav Shashidhara

Swati Nair

Vivek Vasisht Ediga


Date of Submission: December 15th, 2024

Contributions¶

  1. Anto Delin Xavier
    Contributed to project ideation and feature engineering ('Category'). Conducted Chi-Squared tests and developed an Upsell model using Apriori. Worked on the GitHub tutorial.

  2. Athreya Sudarshan Srikanth
    Involved in project ideation and feature engineering ('Category'). Performed correlation analysis and built an Upsell model using Eclat. Contributed to the GitHub tutorial.

  3. Pranav Shashidhara
    Contributed to project ideation and feature engineering ('Category'). Conducted Chi-Squared tests, developed a Cross Sell model with Apriori, and created Tableau dashboards.

  4. Swati Nair
    Participated in project ideation and feature engineering ('Category'). Conducted ANOVA tests and developed a Cross Sell model using Eclat. Worked on the final report.

  5. Vivek Vasisht Ediga
    Contributed to project ideation, dataset preprocessing, and feature engineering ('Category'). Performed ANOVA tests and developed Upsell and Cross Sell models using FP Growth. Worked on the final report.

INTRODUCTION¶

Our Retail Recommendation Engine project deals with increasing the sales of retailers through better suggestions for what to sell to their customers. We have applied two important strategies: cross-selling, in which we suggest complementary items such as accessories in electronics, and upselling, which involves suggesting higher upgrades or alternatives, possibly at a higher value. To make these recommendations, three powerful algorithms were used: Apriori, FP-Growth, and ECLAT; these were tested with real transaction data. Each was run twice to see its performance for both types of recommendations. We evaluated them on various metrics such as support, confidence, lift, and efficiency in processing data. This project is important because smarter recommendations translate into much happier customers, bigger basket sizes, and higher profits on the retailer's end. A comparative approach, as presented with these algorithms, helps businesses find an optimum balance between accuracy and speed while scaling up for large datasets with more ease. The idea is to bridge the gap between theory in data science and practical strategies in retail for delivering personalized shopping experiences that would lead to long-term benefits for both customers and businesses.

DATA CURATION¶

The project dataset contains 281,104 customer transaction records, with the following important information: unique InvoiceNo, product identifiers, which include StockCode and Description, Quantity purchased, UnitPrice, InvoiceDate, CustomerID, and Country for regional analysis. This dataset was hosted on a PostgreSQL database on AWS RDS with a connection to Tableau for seamless real-time updates and interactive visualizations. In a bid to focus on meaningful patterns, transactions were filtered to top performing markets, including UK, France, Germany, and EIRE. Data cleaning included the removal of duplicates, handling missing values, correcting invalid entries, and treatment of outliers for reliability and consistency.Feature engineering also involved the addition of a Category column to group products for upselling relevance, recommending logical upgrades, for instance, higher models in the same category. Thus, a very robust and actionable dataset would be prepared for efficient cross-selling and upselling recommendations while ensuring customer-focused insights with high computational performance.

image.png

image.png

DATA CLEANING AND DATA PREPROCESSING¶

Importing the neccessary libraries and dataset

DB connection has been setup in AWS RDS:

Port: 5432

DB link:projects-database.cxgcu68ksihx.us-east-1.rds.amazonaws.com

Database type: Postgress SQL dadtabase.

The Db is linked to Tableau to display live changes in SQL RDS data. We will connect the RDS DB to this notebook.

In [ ]:
!pip install psycopg2-binary
Collecting psycopg2-binary
  Downloading psycopg2_binary-2.9.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Downloading psycopg2_binary-2.9.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.0/3.0 MB 14.2 MB/s eta 0:00:00
Installing collected packages: psycopg2-binary
Successfully installed psycopg2-binary-2.9.10
In [ ]:
import psycopg2
import pandas as pd

# Database connection details
DB_HOST = "projects-database.cxgcu68ksihx.us-east-1.rds.amazonaws.com"
DB_PORT = 5432
DB_NAME = "postgres"
DB_USERNAME = "postgres"
DB_PASSWORD = "DATA602_project"

# Connect to the PostgreSQL database
try:
    connection = psycopg2.connect(
        host=DB_HOST,
        port=DB_PORT,
        database=DB_NAME,
        user=DB_USERNAME,
        password=DB_PASSWORD
    )
    print("Database connection successful.")
except Exception as e:
    print(f"Error connecting to the database: {e}")
Database connection successful.
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('../dataset/DATA602ProjectCleanedNew.csv')
print(df.head())
   Unnamed: 0  index  InvoiceNo StockCode  \
0           1      1     536365     71053   
1           2      2     536365    84406B   
2           3      3     536365    84029G   
3           5      5     536365     22752   
4           7      7     536366     22633   

                           Description  Quantity       InvoiceDate  UnitPrice  \
0                  WHITE METAL LANTERN         6  12-01-2010 08:26       3.39   
1       CREAM CUPID HEARTS COAT HANGER         8  12-01-2010 08:26       2.75   
2  KNITTED UNION FLAG HOT WATER BOTTLE         6  12-01-2010 08:26       3.39   
3         SET 7 BABUSHKA NESTING BOXES         2  12-01-2010 08:26       7.65   
4               HAND WARMER UNION JACK         6  12-01-2010 08:28       1.85   

   CustomerID         Country  
0     17850.0  United Kingdom  
1     17850.0  United Kingdom  
2     17850.0  United Kingdom  
3     17850.0  United Kingdom  
4     17850.0  United Kingdom  

Established a connection to a PostgreSQL database using the psycopg2 library, with error handling to confirm whether the connection is successful.Then reads a cleaned dataset DATA602ProjectCleanedNew.csv into a Pandas DataFrame and displayed the first few rows, setting up the data for further analysis.

Parse the data

In [ ]:
df['InvoiceNo'] = df['InvoiceNo'].astype(str).fillna('')  # Fill NaN with empty strings

# Extract the first character
first_chars = df['InvoiceNo'].str[0]

# Keep only the entries that are alphabetic
unique_letters = first_chars[first_chars.str.isalpha()].unique()
print(unique_letters)
[]
In [ ]:
df['InvoiceNo'] = df['InvoiceNo'].str.replace('^C', '', regex=True)
df['InvoiceNo'] = df['InvoiceNo'].str.replace('^A', '', regex=True)

# Convert 'Quantity' and 'UnitPrice' to numeric, coercing errors to NaN
df['Quantity'] = pd.to_numeric(df['Quantity'], errors='coerce')
df['UnitPrice'] = pd.to_numeric(df['UnitPrice'], errors='coerce')

# Convert 'InvoiceNo' to integer safely
df['InvoiceNo'] = pd.to_numeric(df['InvoiceNo'], errors='coerce')  # Convert with NaN handling
df['InvoiceNo'] = df['InvoiceNo'].dropna().astype(int)  # Drop NaNs and convert to int

Converted the InvoiceNo column to strings, filled missing values with empty strings, and extracted the first character of each entry to identify unique alphabetic characters. Converted the Quantity and UnitPrice columns to numeric types, handling errors by coercing invalid values to NaN. Safely converted the InvoiceNo column to integers by handling NaN values and dropping them before the conversion.

Cleaning and Organizing the data

In [ ]:
# Handling missing values
df.fillna({'CustomerID': 'Unknown', 'Description': 'Unknown'}, inplace=True)
df = df[(df['Quantity'] >= 0) & (df['UnitPrice'] >= 0.1)]
df
Out[ ]:
Unnamed: 0 index InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 1 1 536365 71053 WHITE METAL LANTERN 6 12-01-2010 08:26 3.39 17850.0 United Kingdom
1 2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12-01-2010 08:26 2.75 17850.0 United Kingdom
2 3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12-01-2010 08:26 3.39 17850.0 United Kingdom
3 5 5 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 12-01-2010 08:26 7.65 17850.0 United Kingdom
4 7 7 536366 22633 HAND WARMER UNION JACK 6 12-01-2010 08:28 1.85 17850.0 United Kingdom
... ... ... ... ... ... ... ... ... ... ...
281099 521720 521720 581587 23256 CHILDRENS CUTLERY SPACEBOY 4 12-09-2011 12:50 4.15 12680.0 France
281100 521721 521721 581587 22613 PACK OF 20 SPACEBOY NAPKINS 12 12-09-2011 12:50 0.85 12680.0 France
281101 521723 521723 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 12-09-2011 12:50 4.15 12680.0 France
281102 521724 521724 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 12-09-2011 12:50 4.15 12680.0 France
281103 521725 521725 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 12-09-2011 12:50 4.95 12680.0 France

281104 rows × 10 columns

Handled missing values by filling CustomerID and Description with Unknown. Filtered the dataset to include only rows where Quantity was greater than or equal to 0 and UnitPrice was greater than or equal to 0.1.

In [ ]:
# Removing special characters and values containing lowercase letters from Description (Description is said to contain product names in uppercase letters only)
df = df[~df['Description'].str.contains(r'[^a-zA-Z0-9\s]', regex=True)]
df = df[~df['Description'].str.contains(r'[^A-Z0-9\s]', regex=True)]
# Display updated dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 281104 entries, 0 to 281103
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Unnamed: 0   281104 non-null  int64  
 1   index        281104 non-null  int64  
 2   InvoiceNo    281104 non-null  int32  
 3   StockCode    281104 non-null  object 
 4   Description  281104 non-null  object 
 5   Quantity     281104 non-null  int64  
 6   InvoiceDate  281104 non-null  object 
 7   UnitPrice    281104 non-null  float64
 8   CustomerID   281104 non-null  float64
 9   Country      281104 non-null  object 
dtypes: float64(2), int32(1), int64(3), object(4)
memory usage: 22.5+ MB

Removed special characters and values containing lowercase letters from the Description column, ensuring it contained only product names in uppercase letters and numbers. Displayed the updated dataset structure using df.info().

In [ ]:
# Define a function to detect outliers using IQR
def remove_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

# Remove outliers from Quantity and UnitPrice columns
df_cleaned = remove_outliers(df, 'Quantity')
df_cleaned = remove_outliers(df_cleaned, 'UnitPrice')

# Display the cleaned dataset after outlier removal
df_cleaned.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 265228 entries, 0 to 281103
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Unnamed: 0   265228 non-null  int64  
 1   index        265228 non-null  int64  
 2   InvoiceNo    265228 non-null  int32  
 3   StockCode    265228 non-null  object 
 4   Description  265228 non-null  object 
 5   Quantity     265228 non-null  int64  
 6   InvoiceDate  265228 non-null  object 
 7   UnitPrice    265228 non-null  float64
 8   CustomerID   265228 non-null  float64
 9   Country      265228 non-null  object 
dtypes: float64(2), int32(1), int64(3), object(4)
memory usage: 21.2+ MB

Defined a function to detect and remove outliers using the IQR method. Applied the function to the Quantity and UnitPrice columns to filter out values outside the calculated bounds. Displayed the structure of the cleaned dataset using df_cleaned.info().

In [ ]:
df_cleaned
Out[ ]:
Unnamed: 0 index InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 1 1 536365 71053 WHITE METAL LANTERN 6 12-01-2010 08:26 3.39 17850.0 United Kingdom
1 2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12-01-2010 08:26 2.75 17850.0 United Kingdom
2 3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12-01-2010 08:26 3.39 17850.0 United Kingdom
4 7 7 536366 22633 HAND WARMER UNION JACK 6 12-01-2010 08:28 1.85 17850.0 United Kingdom
5 8 8 536366 22632 HAND WARMER RED POLKA DOT 6 12-01-2010 08:28 1.85 17850.0 United Kingdom
... ... ... ... ... ... ... ... ... ... ...
281099 521720 521720 581587 23256 CHILDRENS CUTLERY SPACEBOY 4 12-09-2011 12:50 4.15 12680.0 France
281100 521721 521721 581587 22613 PACK OF 20 SPACEBOY NAPKINS 12 12-09-2011 12:50 0.85 12680.0 France
281101 521723 521723 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 12-09-2011 12:50 4.15 12680.0 France
281102 521724 521724 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 12-09-2011 12:50 4.15 12680.0 France
281103 521725 521725 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 12-09-2011 12:50 4.95 12680.0 France

265228 rows × 10 columns

In [ ]:
df_cleaned.to_csv("DATA602ProjectCleanedNew.csv")

Basic Data Exploration and Summary Statistics¶

In [ ]:
print(df.info())
print(df.isnull().sum())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 281104 entries, 0 to 281103
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Unnamed: 0   281104 non-null  int64  
 1   index        281104 non-null  int64  
 2   InvoiceNo    281104 non-null  int32  
 3   StockCode    281104 non-null  object 
 4   Description  281104 non-null  object 
 5   Quantity     281104 non-null  int64  
 6   InvoiceDate  281104 non-null  object 
 7   UnitPrice    281104 non-null  float64
 8   CustomerID   281104 non-null  float64
 9   Country      281104 non-null  object 
dtypes: float64(2), int32(1), int64(3), object(4)
memory usage: 22.5+ MB
None
Unnamed: 0     0
index          0
InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

Displayed the dataset structure using df.info(), which provides details about the number of entries, columns, and data types. Counted and printed the total number of missing values in each column using df.isnull().sum().

Summary statistics

In [ ]:
# Removing duplicates (if any)
df.drop_duplicates(inplace=True)
In [ ]:
print(df.describe())
print(df.describe(include=['O']))
          Unnamed: 0          index      InvoiceNo       Quantity  \
count  281104.000000  281104.000000  281104.000000  281104.000000   
mean   270417.465774  270417.465774  560837.831891       5.608792   
std    147100.082629  147100.082629   13138.266015       4.608165   
min         1.000000       1.000000  536365.000000       1.000000   
25%    144817.750000  144817.750000  549387.000000       2.000000   
50%    277621.500000  277621.500000  562166.000000       4.000000   
75%    396791.250000  396791.250000  572309.000000      10.000000   
max    521725.000000  521725.000000  581587.000000      23.000000   

           UnitPrice     CustomerID  
count  281104.000000  281104.000000  
mean        2.664358   15438.043834  
std         2.038989    1655.061674  
min         0.100000   12413.000000  
25%         1.250000   14096.000000  
50%         1.950000   15379.000000  
75%         3.750000   16904.000000  
max         8.500000   18287.000000  
       StockCode              Description       InvoiceDate         Country
count     281104                   281104            281104          281104
unique      2961                     3097             14855               4
top       85099B  JUMBO BAG RED RETROSPOT  11/14/2011 15:27  United Kingdom
freq        1283                     1283               423          264436

Removed duplicate rows from the dataset using df.drop_duplicates(inplace=True). Displayed summary statistics for numeric columns using df.describe() and for object-type columns using df.describe(include=['O']).

Histogram for numerical columns and Box Plot for numerical columns

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram for numerical columns
df.hist(bins=20, figsize=(14, 10), grid=False)
plt.show()
# Box plot for numerical columns
df.plot(kind='box', subplots=True, layout=(5, 5), figsize=(14, 10), grid=False)
plt.show()
No description has been provided for this image
No description has been provided for this image

Plotted a series of histograms to display the distributions of index, InvoiceNo, Quantity, UnitPrice, and CustomerID columns. The histograms provided insights into the spread and frequency of values across these columns. Additionally, box plots highlighted the ranges and outliers for helping to visualize the central tendency and variability in the data.

Bar graph to display countries with the highest transaction volumes.

In [ ]:
plt.figure(figsize=(10, 6))
top_countries = df['Country'].value_counts().head(10)
top_countries.plot(kind='bar', color='skyblue')  # Specify color for consistency
plt.title('Top Countries by Transactions')
plt.xlabel('Country')
plt.ylabel('Number of Transactions')
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image

Created a bar chart to show the top countries by transactions, where the United Kingdom had the highest number of transactions by a significant margin compared to other countries like Germany, France, and EIRE. The chart highlights the dominance of the UK in the dataset's transaction distribution.

Getting descriptions for these products and obtaining total count

In [ ]:
top_products = df['StockCode'].value_counts().head(10).reset_index()
top_products.columns = ['StockCode', 'Count']

# Get the descriptions for these products
top_product_details = df[df['StockCode'].isin(top_products['StockCode'])][['StockCode', 'Description']].drop_duplicates()

# Merge to get the counts along with descriptions
top_product_details = pd.merge(top_product_details, top_products, on='StockCode')

# Print the results
print(top_product_details)
   StockCode                        Description  Count
0      20725            LUNCH BAG RED RETROSPOT   1149
1      84879      ASSORTED COLOUR BIRD ORNAMENT   1001
2      22382         LUNCH BAG SPACEBOY DESIGN     937
3      22383            LUNCH BAG SUKI  DESIGN     997
4      22383             LUNCH BAG SUKI DESIGN     997
5     85099B            JUMBO BAG RED RETROSPOT   1283
6      47566                      PARTY BUNTING   1221
7      22720  SET OF 3 CAKE TINS PANTRY DESIGN    1029
8      23298                     SPOTTY BUNTING    934
9      23209          LUNCH BAG DOILEY PATTERN     926
10     23209           LUNCH BAG VINTAGE DOILY     926
11     23209          LUNCH BAG VINTAGE DOILEY     926
12     23203          JUMBO BAG DOILEY PATTERNS    963
13     23203          JUMBO BAG VINTAGE DOILEY     963
14     23203           JUMBO BAG VINTAGE DOILY     963

Identified the top 10 products by StockCode frequency, retrieved their descriptions, and merged the details with their counts for a summary of popular items.

Top 10 customers by overall spendings

In [ ]:
df['Total_Sales'] = df['Quantity'] * df['UnitPrice']

if 'CustomerID' in df.columns and 'Total_Sales' in df.columns:
    # Calculate total sales per customer
    customer_spending = df.groupby('CustomerID')['Total_Sales'].sum().reset_index()
    top_10_customers = customer_spending.sort_values(by='Total_Sales', ascending=False).head(10)

    # Calculate total spending and percentage contribution
    total_spending = top_10_customers['Total_Sales'].sum()
    top_10_customers['Percentage'] = (top_10_customers['Total_Sales'] / total_spending) * 100

    # Create the bar plot
    plt.figure(figsize=(12, 6))
    bars = plt.bar(top_10_customers['CustomerID'].astype(str), top_10_customers['Total_Sales'] / 1000, color='skyblue', width=0.6)
    plt.title('Top 10 Customers by Total Spending', fontsize=16)
    plt.xlabel('Customer ID', fontsize=14)
    plt.ylabel('Total Spending (£k)', fontsize=14)
    plt.xticks(rotation=45, fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # Annotate bars with total spending and percentage
    for bar, percentage in zip(bars, top_10_customers['Percentage']):
        total_value = bar.get_height() * 1000  # Original total price for correct labeling
        rounded_value = round(total_value)  # Round to nearest integer
        plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), f'{rounded_value:,} / {percentage:.2f}%',
                 ha='center', va='bottom', color='black', fontsize=10)

    plt.tight_layout()
    plt.show()
else:
    print("One or more required columns are missing from the DataFrame.")
No description has been provided for this image

Displayed a bar chart of the top 10 customers by total spending, showing the amounts in thousands of pounds (£k) and their percentage contributions. The chart highlighted one customer (14911) with significantly higher spending, accounting for over 28% of the total, followed by others with more moderate contributions.

Customers with the highest sales percentage for each month

In [ ]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], errors='coerce')

# Extract Year and Month
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month

# Calculate monthly sales by CustomerID
monthly_sales = df.groupby(['Year', 'Month', 'CustomerID'])['Total_Sales'].sum().reset_index()

# Calculate total monthly sales
total_monthly_sales = monthly_sales.groupby(['Year', 'Month'])['Total_Sales'].transform('sum')

# Calculate sales percentage
monthly_sales['SalesPercentage'] = (monthly_sales['Total_Sales'] / total_monthly_sales) * 100

# Get the customer with the highest sales percentage for each month
highest_sales_customers = monthly_sales.loc[monthly_sales.groupby(['Year', 'Month'])['SalesPercentage'].idxmax()]

# Plotting
plt.figure(figsize=(12, 6))
bars = plt.bar(
    highest_sales_customers['CustomerID'].astype(str) + ' (' + highest_sales_customers['Month'].astype(str) + '/' + highest_sales_customers['Year'].astype(str) + ')',
    highest_sales_customers['SalesPercentage'],
    color='skyblue'
)

plt.title('Highest Sales Customer Percentage for Each Month')
plt.xlabel('Customer ID (Month/Year)')
plt.ylabel('Sales Percentage (%)')

# Annotate the bars with the percentage values
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width() / 2, yval, f'{yval:.2f}%', ha='center', va='bottom', color='black')

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Displayed a bar chart showing the highest sales percentage by customer for each month, with the percentage contributions labeled above each bar. The chart highlighted variations in the top-performing customers over time, with highest being (14096 - 3.05% ) in Decemeber 2011.

Top Selling Products by Quantity

In [ ]:
top_products = df.groupby('Description')['Quantity'].sum().sort_values(ascending=False).head(10)

top_products_percentage = (top_products / top_products.sum()) * 100

plt.figure(figsize=(10,6))
bars = top_products_percentage.plot(kind='barh', color='skyblue')
plt.title('Top 10 Selling Products by Quantity (Percentage)')
plt.xlabel('Percentage of Total Quantity Sold')
plt.ylabel('Product Description')
plt.gca().invert_yaxis()

for index, value in enumerate(top_products_percentage):
    plt.text(value, index, f'{value:.2f}%', va='center', ha='left', color='black')

plt.show()
No description has been provided for this image

Displayed a horizontal bar chart showing the top 10 selling products by quantity, with their percentage contributions to the total quantity sold. The Jumbo Bag Red Retrospot emerged as the best-selling product, accounting for 15.38% of the total, followed by Assorted Colour Bird Ornament and Lunch Bag Red Retrospot with 13.43% and 12.01%, respectively.

Data Visualization for Sales over Time

In [ ]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month

monthly_sales = df.groupby(['Year', 'Month'])['Total_Sales'].sum()
total_sales_all_time = df['Total_Sales'].sum()
monthly_sales_percentage = (monthly_sales / total_sales_all_time) * 100

plt.figure(figsize=(12, 6))
monthly_sales.plot(kind='line', marker='o', color='b', label='Total Sales')
plt.title('Sales Over Time')
plt.ylabel('Total Sales')
plt.xlabel('Month')
plt.xticks(ticks=range(len(monthly_sales)), labels=[f'{month}/{year}' for year, month in monthly_sales.index], rotation=45)

for i, (total, percentage) in enumerate(zip(monthly_sales, monthly_sales_percentage)):
    plt.text(i, total, f'{percentage:.2f}%', ha='center', va='bottom', color='black')

plt.legend()
plt.tight_layout()
plt.show()

monthly_sales_summary = pd.DataFrame({
    'TotalSales': monthly_sales,
    'SalesPercentage': monthly_sales_percentage
})

print(monthly_sales_summary)
No description has been provided for this image
            TotalSales  SalesPercentage
Year Month                             
2010 12      217980.71         6.397026
2011 1       178437.71         5.236567
     2       178720.67         5.244871
     3       229551.46         6.736590
     4       193893.41         5.690142
     5       276299.32         8.108488
     6       235987.81         6.925476
     7       230165.79         6.754618
     8       243730.26         7.152691
     9       373461.71        10.959888
     10      407237.53        11.951099
     11      503102.18        14.764415
     12      138963.60         4.078130

Displayed a line chart illustrating total sales over time by month. The chart showed fluctuations in sales throughout the year, with a significant peak in November 2011 (14.76%), followed by a steep decline in December 2011 (4.08%). This trend highlights seasonal variations in sales performance.

Countries of top 10 spenders

In [ ]:
# Group by country and sum total sales for each country
country_total_sales = df.groupby('Country')['Total_Sales'].sum()
sorted_country_sales = country_total_sales.sort_values(ascending=False)
top_10_countries = sorted_country_sales.head(10)


df_without_unknown = df[df['CustomerID'] != 'Unknown']
customer_spending = df_without_unknown.groupby('CustomerID')['Total_Sales'].sum()
sorted_customer_spending = customer_spending.sort_values(ascending=False)

top_10_customers = sorted_customer_spending.head(10)
top_10_customer_countries = df_without_unknown[df_without_unknown['CustomerID'].isin(top_10_customers.index)]['Country']
top_customer_country_counts = top_10_customer_countries.value_counts()

plt.figure(figsize=(10, 6))
top_customer_country_counts.plot(kind='bar', color='skyblue')
plt.title('Countries of Top 10 Spenders (Excluding Unknown)')
plt.xlabel('Country')
plt.ylabel('Number of Top Spenders')
plt.xticks(rotation=90)
plt.show()
df.columns
No description has been provided for this image
Out[ ]:
Index(['Unnamed: 0', 'index', 'InvoiceNo', 'StockCode', 'Description',
       'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country',
       'Total_Sales', 'Year', 'Month'],
      dtype='object')

Displayed a bar chart comparing the number of top spenders across countries. The United Kingdom had the highest number of top spenders (just below 20000),significantly outpacing EIRE.

Correlation Matrix

In [ ]:
#Correlation matrix between Quantity, UnitPrice and Total Sales
correlation_matrix = df[['Quantity', 'UnitPrice', 'Total_Sales']].corr()

plt.figure(figsize=(8,6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image

Displayed a correlation matrix to analyze the relationships between Quantity, UnitPrice, and Total_Sales. The chart showed a moderate positive correlation between Quantity and Total_Sales (0.55) and between UnitPrice and Total_Sales (0.4), while Quantity and UnitPrice exhibited a weak negative correlation (-0.3).

Boxplot for Quantity and UnitPrice

In [ ]:
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
sns.boxplot(df['Quantity'])
plt.title('Boxplot of Quantity')

plt.subplot(1,2,2)
sns.boxplot(df['UnitPrice'])
plt.title('Boxplot of UnitPrice')

plt.show()
c:\Users\Swati\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
c:\Users\Swati\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
No description has been provided for this image

Displayed boxplots for Quantity and UnitPrice to identify the distribution and presence of outliers. The Quantity boxplot showed minimal outliers on the higher end, while the UnitPrice boxplot revealed several outliers on the upper range, indicating variability in pricing.

Customer Frequency Purchases

In [ ]:
# outliers need to be handled for Quantity for Unit Price it is not an issue. Minimal outliers seem to be present they should be handled.
# ensuring unkown is not put in this
purchase_frequency = df[df['CustomerID']!='Unknown'].groupby('CustomerID')['InvoiceNo'].nunique()

plt.figure(figsize=(10,6))
sns.histplot(purchase_frequency, bins=50,kde=True)
plt.title('Customer Purchase Frequency')
plt.xlabel('Number of Purchases')
plt.ylabel('Number of Customers')
plt.show()

purchase_frequency
No description has been provided for this image
Out[ ]:
CustomerID
12413.0     3
12426.0     1
12427.0     3
12437.0    18
12441.0     1
           ..
18280.0     1
18281.0     1
18282.0     2
18283.0    16
18287.0     2
Name: InvoiceNo, Length: 3929, dtype: int64

Displayed a histogram showing the customer purchase frequency. The chart revealed that the majority of customers made fewer than 10 purchases, with the frequency declining sharply as the number of purchases increased, indicating a highly skewed distribution.

In [ ]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month

monthly_sales = df.groupby(['Year', 'Month', 'CustomerID'])['Total_Sales'].sum().reset_index()
total_monthly_sales = monthly_sales.groupby(['Year', 'Month'])['Total_Sales'].transform('sum')
monthly_sales['SalesPercentage'] = (monthly_sales['Total_Sales'] / total_monthly_sales) * 100

highest_sales_customers = monthly_sales.loc[monthly_sales.groupby(['Year', 'Month'])['SalesPercentage'].idxmax()]

fig, ax1 = plt.subplots(figsize=(12, 6))

bars = ax1.bar(highest_sales_customers['CustomerID'].astype(str) + ' (' + highest_sales_customers['Month'].astype(str) + '/' + highest_sales_customers['Year'].astype(str) + ')',
                highest_sales_customers['SalesPercentage'], color='skyblue', alpha=0.7)

ax1.set_title('Highest Sales Customer Percentage and Total Sales Over Time')
ax1.set_xlabel('Customer ID (Month/Year)')
ax1.set_ylabel('Sales Percentage (%)', color='skyblue')
ax1.tick_params(axis='y', labelcolor='skyblue')

total_sales = df.groupby(['Year', 'Month'])['Total_Sales'].sum()
ax2 = ax1.twinx()
total_sales.plot(kind='line', marker='o', color='b', label='Total Sales', ax=ax2)

ax2.set_ylabel('Total Sales (£)', color='b')
ax2.tick_params(axis='y', labelcolor='b')

for bar in bars:
    yval = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width() / 2, yval, f'{yval:.2f}%', ha='center', va='bottom', color='black')

ax1.set_xticklabels(highest_sales_customers['CustomerID'].astype(str) + ' (' + highest_sales_customers['Month'].astype(str) + '/' + highest_sales_customers['Year'].astype(str) + ')', rotation=45)
ax1.legend(['Highest Sales Customer Percentage'], loc='upper left')
ax2.legend(['Total Sales'], loc='upper right')

plt.tight_layout()
plt.show()
No description has been provided for this image

Displayed a combination chart showing the highest sales customer percentage and total sales over time. The bars highlighted the contribution of top customers to sales each month, with percentages peaking in November 2011 (3.02%). The line chart indicated a sharp rise in total sales leading up to November, followed by a steep decline in December 2011.

Top 10 Products by Quantity sold in November and December 2011

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

nov_sales = df[(df['InvoiceDate'].dt.year == 2011) & (df['InvoiceDate'].dt.month == 11)]
dec_sales = df[(df['InvoiceDate'].dt.year == 2011) & (df['InvoiceDate'].dt.month == 12)]

nov_product_sales = nov_sales.groupby('Description')['Quantity'].sum().reset_index().sort_values(by='Quantity', ascending=False).head(10)
dec_product_sales = dec_sales.groupby('Description')['Quantity'].sum().reset_index().sort_values(by='Quantity', ascending=False).head(10)

top_products = pd.merge(nov_product_sales, dec_product_sales, on='Description', how='outer', suffixes=('_Nov', '_Dec')).fillna(0)

plt.figure(figsize=(12, 6))
width = 0.4

plt.bar(top_products['Description'].astype(str), top_products['Quantity_Nov'], width=width, label='November', align='center', color='lightblue')
plt.bar(top_products['Description'].astype(str), top_products['Quantity_Dec'], width=width, label='December', align='edge', color='salmon')

plt.title('Top 10 Products by Quantity Sold in November and December 2011')
plt.xlabel('Product Description')
plt.ylabel('Total Quantity Sold')
plt.xticks(rotation=45)
plt.legend()

plt.tight_layout()
plt.show()
No description has been provided for this image

Displayed a bar chart comparing the top 10 products by quantity sold in November and December 2011. The chart highlighted a significant drop in sales for most products in December compared to November, with the Rabbit Night Light being the most sold product in November but showing a sharp decline in December.

EDA Summary and Insights¶

  1. There are no suspicious outliers and no null values in the dataset

  2. Only one one month's (December) data is availabe for the year 2010 (Have to be carefully while doing any year related analysis)

  3. 85.65% sales coming from UK

  4. The highest sales come from the Customer 14911 with 28.39% followed by the Customer 14096 with 12.87%

  5. The top two products with highest sales are PARTY BUNTING and JUMBO BAG RED RETROSPOT with 16% , 12% respectively.

  6. The top two products with highest quantity sales are JUMBO BAG RED RETROSPOT and ASSORTED COLOUR BIRD ORNAMENT with 15% , 13% respectively.

  7. 14911 topped highest sales in 8/13 months and the last three months with 14096, but with highest sales share of just under 3% - shows the diversity in Customers

  8. There was a fluctualting growth from 12/2010 to 8/2011 and has seen a tremendous gowth from there before witnessing a significant dip in 12/2011 from 14.57 to 4 percentage

  9. There was cut in sales of almsot all the product in Dec 2011 ( with highest cuts in 14911 and 14096)

Hypothesis Testing¶

Test 1: Correlation Analysis Between Unit Prices of Frequently Bought Items¶

Objective:
Determine if there is a positive correlation between the unit prices of items frequently bought together, uncovering trends like whether high-priced items are purchased together.

Hypotheses:

  • Null Hypothesis (H₀): No correlation between unit prices of frequently bought items (Pearson correlation coefficient = 0).
  • Alternative Hypothesis (Ha): Positive correlation exists (Pearson correlation coefficient > 0).
In [ ]:
from itertools import combinations
from collections import defaultdict
import numpy as np
from scipy.stats import pearsonr
invoice_groups = df.groupby('InvoiceNo')['StockCode'].apply(list)

item_pairs = []
for items in invoice_groups:
    if len(items) > 1:
        item_pairs.extend(combinations(items, 2))

item_prices = df.set_index('StockCode')['UnitPrice'].to_dict()

pair_prices = defaultdict(list)
for item1, item2 in item_pairs:
    if item1 in item_prices and item2 in item_prices:
        pair_prices['Item1_Price'].append(item_prices[item1])
        pair_prices['Item2_Price'].append(item_prices[item2])

item1_prices = np.array(pair_prices['Item1_Price'])
item2_prices = np.array(pair_prices['Item2_Price'])

corr_coefficient, p_value = pearsonr(item1_prices, item2_prices)
print(f"Pearson Correlation Coefficient: {corr_coefficient}")

if p_value < 0.05:
    print("Statistically significant positive correlation between prices of items bought together.")
else:
    print("No significant correlation between prices of items bought together.")

plt.figure(figsize=(8, 6))
plt.scatter(item1_prices, item2_prices, alpha=0.5, color='blue')
plt.title('Scatter Plot of Paired Item Prices')
plt.xlabel('Item 1 Price')
plt.ylabel('Item 2 Price')
plt.grid(True)
plt.show()

corr_data = pd.DataFrame({'Item1_Price': item1_prices, 'Item2_Price': item2_prices})
plt.figure(figsize=(6, 4))
sns.heatmap(corr_data.corr(), annot=True, cmap='Blues')
plt.title('Heatmap of Paired Item Price Correlation')
plt.show()

plt.figure(figsize=(8, 4))
sns.boxplot(data=[item1_prices, item2_prices], palette='Set2')
plt.xticks([0, 1], ['Item 1 Prices', 'Item 2 Prices'])
plt.title('Box Plot of Item Prices')
plt.show()
Pearson Correlation Coefficient: 0.1157039282436545
Statistically significant positive correlation between prices of items bought together.
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Conclusion

The correlation analysis done shows a weak positive in the prices of items bought together, which can further imply that more highly priced items are likely to be bought together with other highly priced items; hence, there is a need to further study customer buying behaviour.

So we reject the null hypothesis

Test 2: Comparison of Mean Quantities Purchased Across Different Countries (ANOVA Test)¶

Objective:
Compare the average quantities purchased among countries (e.g., France, Germany, EIRE, UK) to identify significant differences in purchasing behavior.

Hypotheses:

  • Null Hypothesis (H₀): No significant differences in mean quantities purchased across countries (μFrance = μGermany = μEIRE = μUK).
  • Alternate Hypothesis (H₁): At least one country pair has a significant difference in mean quantities purchased.
In [ ]:
from scipy.stats import f_oneway

data = df[['Country', 'Quantity']]

grouped_data = data.groupby('Country')['Quantity'].apply(list)
f_stat, p_value = f_oneway(*grouped_data)
print(f"ANOVA results: F-statistic = {f_stat}, p-value = {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There are significant differences in quantities purchased among countries.")
else:
    print("Fail to reject the null hypothesis: No significant differences in quantities purchased among countries.")

plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='Country', y='Quantity', palette='coolwarm')
plt.title('Boxplot of Quantities Purchased by Country')
plt.xlabel('Country')
plt.ylabel('Quantity')
plt.xticks(rotation=45)
plt.show()
ANOVA results: F-statistic = 2536.468323116702, p-value = 0.0
Reject the null hypothesis: There are significant differences in quantities purchased among countries.
No description has been provided for this image

Tukey's HSD Test to provide specific pairwise comparison results, showing which countries have significant differences in purchasing quantities.

In [ ]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey_results = pairwise_tukeyhsd(data['Quantity'], data['Country'], alpha=0.05)
print(tukey_results)

tukey_results.plot_simultaneous()
plt.title("Tukey's HSD Results")
plt.show()
    Multiple Comparison of Means - Tukey HSD, FWER=0.05     
============================================================
 group1     group2     meandiff p-adj  lower   upper  reject
------------------------------------------------------------
   EIRE         France   0.4272 0.001   0.196  0.6584   True
   EIRE        Germany   0.7777 0.001  0.5508  1.0045   True
   EIRE United Kingdom  -2.7087 0.001 -2.8829 -2.5344   True
 France        Germany   0.3504 0.001  0.1378  0.5631   True
 France United Kingdom  -3.1359 0.001 -3.2911 -2.9806   True
Germany United Kingdom  -3.4863 0.001 -3.6351 -3.3376   True
------------------------------------------------------------
c:\Users\Swati\anaconda3\lib\site-packages\statsmodels\sandbox\stats\multicomp.py:775: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax1.set_yticklabels(np.insert(self.groupsunique.astype(str), 0, ''))
No description has been provided for this image

Conclusion

The analysis reveals distinct purchasing behaviors among the countries studied: the United Kingdom has the highest quantities purchased, significantly more than EIRE, which ranks second. France and Germany show similar purchasing levels with no significant difference between them. So we reject null hypothesis saying that there are no significant differences in quantities purchased among countries

Test 3: Country Association with Bulk Purchases (Chi-Squared Test)¶

Objective:
Determine if customers from different countries are more likely to make bulk purchases (quantity > 10) compared to others.

Hypotheses:

  • Null Hypothesis (H₀): Bulk purchasing patterns are similar across all countries.
  • Alternate Hypothesis (H₁): Customers from specific countries are more likely to make bulk purchases.
In [ ]:
import pandas as pd
import scipy.stats as stats


bulk_threshold = 15

df['BulkPurchase'] = df['Quantity'].apply(lambda x: 'Bulk' if x > bulk_threshold else 'Non-Bulk')

contingency_table = pd.crosstab(df['Country'], df['BulkPurchase'])

chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

print("Contingency Table:")
print(contingency_table.sort_values(by = 'Bulk',ascending=False))
print("\nChi-squared Statistic:", chi2)
print("P-value:", p_value)
print("Degrees of Freedom:", dof)
print("Expected Frequencies Table:")
#print(expected)

if p_value < 0.05:
    print("Reject the null hypothesis. There is a significant association between Country and Bulk Purchase behavior.")
else:
    print("Fail to reject the null hypothesis. No significant association between Country and Bulk Purchase behavior.")
Contingency Table:
BulkPurchase    Bulk  Non-Bulk
Country                       
United Kingdom  7591    256845
Germany          402      5910
France           328      5457
EIRE             216      4355

Chi-squared Statistic: 443.2339468410355
P-value: 9.5324481819625e-96
Degrees of Freedom: 3
Expected Frequencies Table:
Reject the null hypothesis. There is a significant association between Country and Bulk Purchase behavior.
In [ ]:
bulk_threshold = 15
df['BulkPurchase'] = df['Quantity'].apply(lambda x: 'Bulk' if x > bulk_threshold else 'Non-Bulk')

bulk_purchase_by_country = df.groupby(['Country', 'BulkPurchase'])['InvoiceNo'].count().unstack()

plt.figure(figsize=(15, 8))
bulk_purchase_by_country.plot(kind='bar', stacked=True)
plt.title('Bulk Purchase vs Country')
plt.xlabel('Country')
plt.ylabel('Number of Purchases')
plt.xticks(rotation=90)
plt.legend(['Non-Bulk', 'Bulk'])
plt.tight_layout()
plt.show()
<Figure size 1080x576 with 0 Axes>
No description has been provided for this image

Conclusion

The analysis indicates a significant relationship between the customers from United Kingdom showed the highest bulk quantity purchasing features compared to other countries as shown by the chi-squared test and the contigency table which shows the frequency of the tables.

Since the p-value is very less than 0.05, we reject the null hypothesis saying that there is a significant assosiation between Country and Bulk purchase behaviour.

Test 4: Mean Unit Prices Across Countries (ANOVA Test)¶

Objective:
Determine if there are significant differences in the average unit prices of items purchased across the United Kingdom, Germany, France, and EIRE.

Hypotheses:

  • Null Hypothesis (H₀): No significant differences in average unit prices between countries (μFrance = μGermany = μEIRE = μUK).
  • Alternate Hypothesis (H₁): At least one country pair has a significant difference in average unit prices (μFrance ≠ μGermany ≠ μEIRE ≠ μUK).
In [ ]:
from scipy.stats import f_oneway

df_cleaned = df.dropna(subset=['UnitPrice', 'Country'])

# Group the data by Country and extract UnitPrice for each country
uk_prices = df_cleaned[df_cleaned['Country'] == 'United Kingdom']['UnitPrice']
germany_prices = df_cleaned[df_cleaned['Country'] == 'Germany']['UnitPrice']
france_prices = df_cleaned[df_cleaned['Country'] == 'France']['UnitPrice']
eire_prices = df_cleaned[df_cleaned['Country'] == 'EIRE']['UnitPrice']

# Perform the ANOVA test
anova_result = f_oneway(uk_prices, germany_prices, france_prices, eire_prices)
print("ANOVA test result: F-statistic =", anova_result.statistic, ", p-value =", anova_result.pvalue)

alpha = 0.05

if anova_result.pvalue < alpha:
    print("Reject the null hypothesis. There is a significant difference in unit prices between countries.")
else:
    print("Fail to reject the null hypothesis. No significant difference in unit prices between countries.")
ANOVA test result: F-statistic = 98.47390944117025 , p-value = 1.0521299834896724e-63
Reject the null hypothesis. There is a significant difference in unit prices between countries.
In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(14, 7))
sns.set_style("whitegrid")
sns.boxplot(data=df, x='Country', y='Quantity', palette='viridis')
plt.title('Quantities Purchased by Country', fontsize=16)
plt.xlabel('Country', fontsize=12)
plt.ylabel('Quantity', fontsize=12)
plt.xticks(rotation=45, fontsize=10)
plt.tight_layout()
plt.show()
No description has been provided for this image

Given that the p-value is significantly less than the conventional significance level of 0.05, we reject the null hypothesis. This indicates that there is strong statistical evidence to conclude that the average unit price of items purchased varies significantly between at least two of the countries in the dataset.

As a result, this finding suggests that customers in different countries exhibit different levels of price sensitivity.

Test 5: Time of purchase association with Bulk purchase (chi-squared test)¶

Objective:
Determine if there is a significant association between the time of purchase (weekly patterns) and bulk purchasing behavior (quantity > 100).

Hypotheses:

  • Null Hypothesis (H₀): No significant association between time (week of the year) and bulk purchasing behavior.
  • Alternate Hypothesis (H₁): A significant association exists between time (week of the year) and bulk purchasing behavior.
In [ ]:
def parse_date(date_str):
    for fmt in ("%d-%m-%Y %H:%M", "%m/%d/%Y %H:%M"):
        try:
            return pd.to_datetime(date_str, format=fmt)
        except ValueError:
            continue
    raise ValueError(f"Date format not recognized for {date_str}")


df['InvoiceDate'] = df['InvoiceDate'].apply(parse_date)
In [ ]:
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt


df['Week'] = df['InvoiceDate'].dt.isocalendar().week.astype(str) + '-' + df['InvoiceDate'].dt.year.astype(str)


invoice_bulk_counts = df.groupby('InvoiceNo')['StockCode'].nunique()
df['BulkPurchase'] = df['InvoiceNo'].apply(lambda x: 'Bulk' if invoice_bulk_counts[x] > 100 else 'Non-Bulk')
contingency_table = pd.crosstab(df['Week'], df['BulkPurchase'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print("Contingency Table:")
print(contingency_table)
print("\nChi-squared Statistic:", chi2)
print("P-value:", p_value)
print("Degrees of Freedom:", dof)
print("\nExpected Frequencies Table:")
print(expected)

if p_value < 0.05:
    print("\nReject the null hypothesis. There is a significant association between time and Bulk Purchase behavior.")
else:
    print("\nFail to reject the null hypothesis. No significant association between time and Bulk Purchase behavior.")
In [ ]:
plt.figure(figsize=(17, 10))
contingency_table.plot(kind='bar', stacked=True, ax=plt.gca())
plt.title('Bulk vs Non-Bulk Purchases by Week', fontsize=20)
plt.xlabel('Week', fontsize=15)
plt.ylabel('Number of Purchases', fontsize=15)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.legend(title='Purchase Type', fontsize=12)
plt.tight_layout()
plt.show()
No description has been provided for this image

Conclusion: Given that the p-value is less than the typical alpha level of 0.05, we can reject the null hypothesis. This suggests that there is a significant association between the time of purchase (weeks) and the likelihood of making bulk purchases based on the unique product quantities (quantity >100). In other words, purchasing patterns vary significantly across different weeks, indicating that some weeks may have a higher tendency for bulk purchases than others.

Machine Learning Models Analysis¶

Objective

The purpose of this analysis is to utilize machine learning techniques to identify patterns in purchasing behavior and generate actionable recommendations for upselling and cross-selling. These insights aim to improve customer engagement and increase revenue across different countries and product categories.


Apriori Algorithm (Upsell)¶

Code Implementation¶

# Installing necessary libraries
!pip install psycopg2
!pip install psycopg2-binary

# Database Connection and Data Retrieval
import pandas as pd
import psycopg2
from mlxtend.frequent_patterns import apriori, association_rules

# Define RDS connection parameters
host = "projects-database.cxgcu68ksihx.us-east-1.rds.amazonaws.com"
port = "5432"
dbname = "postgres"
user = "postgres"
password = "DATA602_project"

# Establish connection and fetch data
try:
    conn = psycopg2.connect(
        host=host, port=port, database=dbname, user=user, password=password
    )
    cur = conn.cursor()
    cur.execute("SELECT * FROM sales_data;")
    data = cur.fetchall()
    columns = [desc[0] for desc in cur.description]
    ret = pd.DataFrame(data, columns=columns)

except Exception as err:
    print(f"Database connection error: {err}")
finally:
    if conn:
        cur.close()
        conn.close()

# Helper to convert frozensets to strings
def frozenset_to_string(fset):
    return ', '.join(map(str, fset)) if isinstance(fset, frozenset) else str(fset)

# Basket Preparation and Apriori Rules
def prepare_basket(data):
    return (
        data.groupby(['invoiceno', 'description'])['quantity'].sum()
        .unstack()
        .fillna(0)
        .applymap(lambda x: 1 if x > 0 else 0)
    )

def generate_apriori_rules(basket, min_support=0.03, min_confidence=0.01):
    frequent_itemsets = apriori(basket, min_support=min_support, use_colnames=True)
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)
    return rules

# Upsell Filtering
def filter_upsell_rules(rules, data):
    valid_upsells = []
    for _, rule in rules.iterrows():
        antecedent_items = list(rule['antecedents'])
        consequent_items = list(rule['consequents'])

        # Fetch prices for comparison
        antecedent_prices = data[data['description'].isin(antecedent_items)]['unitprice'].tolist()
        consequent_prices = data[data['description'].isin(consequent_items)]['unitprice'].tolist()

        if antecedent_prices and consequent_prices and max(antecedent_prices) < min(consequent_prices):
            valid_upsells.append(rule)

    return pd.DataFrame(valid_upsells)

# Per-Country Processing and Final Output
upsell_results = []

for country in ret['country'].unique():
    country_data = ret[ret['country'] == country]

    if not country_data.empty:
        # Prepare transactional data
        basket = prepare_basket(country_data)

        # Generate rules using Apriori
        rules = generate_apriori_rules(basket, min_support=0.03, min_confidence=0.01)

        if not rules.empty:
            # Filter rules for upsell
            filtered_upsell = filter_upsell_rules(rules, country_data)

            if not filtered_upsell.empty:
                filtered_upsell['country'] = country
                upsell_results.append(filtered_upsell)

if upsell_results:
    final_upsell_df = pd.concat(upsell_results, ignore_index=True)

    # Convert frozensets to strings for readability
    final_upsell_df['antecedents'] = final_upsell_df['antecedents'].apply(frozenset_to_string)
    final_upsell_df['consequents'] = final_upsell_df['consequents'].apply(frozenset_to_string)
    final_upsell_df_sorted = final_upsell_df.sort_values(by=['lift', 'confidence'], ascending=[False, False])


    print(final_upsell_df_sorted[['country', 'antecedents', 'support','consequents', 'confidence', 'lift']])

Results¶

Country Antecedents Support Consequents Confidence Lift
France PACK OF 6 SKULL PAPER CUPS 0.042 PACK OF 6 SKULL PAPER PLATES 0.714 14.29
France LUNCH BAG APPLE DESIGN 0.042 LUNCH BOX WITH CUTLERY RETROSPOT 0.312 2.08
EIRE COOK WITH WINE METAL SIGN 0.054 PLEASE ONE PERSON METAL SIGN 0.538 12.49

Summary¶

Established a connection to an RDS PostgreSQL database and retrieved sales data into a Pandas DataFrame. It prepared transactional data using a basket format and implemented the Apriori algorithm to generate frequent itemsets and association rules. A custom function filtered upsell recommendations by comparing product prices, ensuring upsell items were more expensive than the antecedents. Finally, upsell recommendations were processed country-wise and displayed based on confidence and lift values.

Insights¶

  • High Lift Values: The 14.29 lift between “PACK OF 6 SKULL PAPER CUPS” and “PACK OF 6 SKULL PAPER PLATES” in France highlights strong upsell potential. Products with high lift values represent significant opportunities for upselling to existing customers.
  • Cross-Category Upselling: For example, the transition from “LUNCH BAG APPLE DESIGN” to “LUNCH BOX WITH CUTLERY RETROSPOT” indicates opportunities to promote complementary items in different categories.
  • Country-Specific Marketing: Upsell patterns, such as in EIRE where “COOK WITH WINE METAL SIGN” leads to “PLEASE ONE PERSON METAL SIGN,” reflect regional preferences, enabling targeted marketing campaigns.

Apriori Algorithm (Cross-Sell)¶

Code Implementation¶

# Install required libraries
!pip install mlxtend psycopg2 sqlalchemy

# Import necessary libraries
import pandas as pd
from sqlalchemy import create_engine
import psycopg2
from mlxtend.frequent_patterns import apriori, association_rules

# Database connection configuration
db_config = {
    "host": "projects-database.cxgcu68ksihx.us-east-1.rds.amazonaws.com",
    "port": "5432",
    "dbname": "postgres",
    "user": "postgres",
    "password": "DATA602_project"
}

# Retrieve data from PostgreSQL database
try:
    with psycopg2.connect(**db_config) as connection:
        with connection.cursor() as cursor:
            cursor.execute("SELECT * FROM sales_data;")
            results = cursor.fetchall()
            columns = [desc[0] for desc in cursor.description]
            sales_data_df = pd.DataFrame(results, columns=columns)
    print("Data retrieved successfully.")
except Exception as error:
    print(f"Error connecting to PostgreSQL: {error}")

# Prepare basket data for Apriori analysis
def prepare_basket(data):
    return (
        data.groupby(['invoiceno', 'description'])['quantity'].sum()
        .unstack()
        .fillna(0)
        .applymap(lambda x: 1 if x > 0 else 0)
    )

# Generate Apriori rules
def generate_apriori_rules(basket, min_support=0.03, min_confidence=0.01):
    frequent_itemsets = apriori(basket, min_support=min_support, use_colnames=True)
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)
    return rules

# Iterate through each country to identify cross-sell opportunities
all_cross_sell_results = []
for country in sales_data_df['country'].unique():
    country_data = sales_data_df[sales_data_df['country'] == country]
    basket = prepare_basket(country_data)
    rules = generate_apriori_rules(basket, min_support=0.03, min_confidence=0.01)
    if not rules.empty:
        rules['country'] = country
        all_cross_sell_results.append(rules)

# Combine results
final_cross_sell_results = pd.concat(all_cross_sell_results, ignore_index=True)

# Save results back to the database
try:
    with psycopg2.connect(**db_config) as connection:
        with connection.cursor() as cursor:
            cursor.execute("""
                CREATE TABLE IF NOT EXISTS Apriori_Cross_Sell (
                    country VARCHAR(255),
                    antecedents TEXT,
                    consequents TEXT,
                    support NUMERIC,
                    confidence NUMERIC,
                    lift NUMERIC
                );
            """)
            data_to_insert = [
                (
                    row['country'],
                    ', '.join(row['antecedents']),
                    ', '.join(row['consequents']),
                    row['support'],
                    row['confidence'],
                    row['lift']
                )
                for _, row in final_cross_sell_results.iterrows()
            ]
            cursor.executemany(
                "INSERT INTO Apriori_Cross_Sell (country, antecedents, consequents, support, confidence, lift) VALUES (%s, %s, %s, %s, %s, %s);",
                data_to_insert
            )
            connection.commit()

print(final_cross_sell_results[['country', 'antecedents', 'support','consequents', 'confidence', 'lift']])

Summary¶

Implemented a Retail Recommendation System using Apriori and association rules to identify cross-sell opportunities. It established a connection to a PostgreSQL database to retrieve sales data and preprocess it into a basket format suitable for frequent pattern mining. The Apriori algorithm was applied to generate frequent itemsets and association rules for each country in the dataset.

Cross-sell opportunities were identified by filtering association rules with high lift and confidence values. The results were stored back into the database for further analysis, along with information on products and categories. Additionally, functions were used to enrich the rules with human-readable descriptions, and all outputs were written to a database table for persistent storage and future usage.

Results¶

Country Antecedents Support Consequents Confidence Lift
France SPACEBOY CHILDRENS CUP 0.031 DOLLY GIRL CHILDRENS BOWL 0.846 23.43
United Kingdom PINK REGENCY TEACUP AND SAUCER 0.024 GREEN REGENCY TEACUP AND SAUCER 0.811 22.10
Germany ROUND SNACK BOXES SET OF 4 FRUITS 0.123 ROUND SNACK BOXES SET OF4 WOODLAND 0.788 3.23

Insights¶

  • Strong Complementary Products: The lift of 23.43 for “SPACEBOY CHILDRENS CUP” and “DOLLY GIRL CHILDRENS BOWL” in France underscores significant cross-sell potential.
  • Regional Preferences: United Kingdom shows a high confidence of 81.1% in pairing “PINK REGENCY TEACUP AND SAUCER” with “GREEN REGENCY TEACUP AND SAUCER,” highlighting opportunities for bundled promotions targeting tea enthusiasts.
  • Popular Bundles in Germany: The 12.3% support for “ROUND SNACK BOXES SET OF 4 FRUITS” with “ROUND SNACK BOXES SET OF4 WOODLAND” points to well-established product pairings.

Eclat Algorithm (Upsell)¶

Code Implementation¶

import pandas as pd

# Load the dataset
file_path = "../dataset/DATA602ProjectCleanedNew.csv"
df = pd.read_csv(file_path)

# Filter necessary columns
df = df[['InvoiceNo', 'Description', 'Country']]

# Group transactions by InvoiceNo
transactions = df.groupby('InvoiceNo')['Description'].apply(list).values.tolist()

# Convert transactions into vertical format
def transactions_to_vertical_format(transactions):
    vertical_format = {}
    for index, transaction in enumerate(transactions):
        for item in transaction:
            item = str(item)  # Ensure item is a string
            if item not in vertical_format:
                vertical_format[item] = set()
            vertical_format[item].add(index)
    return vertical_format

# Function to perform Eclat algorithm and find frequent itemsets
def eclat(vertical_format, min_support_count, num_transactions):
    frequent_itemsets = []

    # Calculate support for single items
    for item, transactions in vertical_format.items():
        support_count = len(transactions)
        if support_count >= min_support_count:
            support = support_count / num_transactions
            frequent_itemsets.append(((item,), support))
    return frequent_itemsets

# Convert to vertical format
vertical_format = transactions_to_vertical_format(transactions)

# Calculate the total number of transactions
num_transactions = len(transactions)

# Set minimum support to 2% of transactions
min_support = 0.02
min_support_count = int(min_support * num_transactions)

# Find frequent itemsets
frequent_itemsets = eclat(vertical_format, min_support_count, num_transactions)

print(frequent_itemsets[['country', 'antecedents', 'support','consequents', 'confidence', 'lift']])

Summary¶

Implemented the Eclat algorithm to discover frequent itemsets from a retail dataset. The dataset was grouped by invoice numbers, and transactions were converted into a vertical format, where each product was mapped to its associated transaction indices. Using a minimum support threshold of 2%, the algorithm identified frequent itemsets with their respective support values, uncovering significant product associations and co-purchasing patterns in the dataset.

Results¶

Country Antecedents Support Consequents Confidence Lift
EIRE GREEN REGENCY TEACUP AND SAUCER 0.112 ROSES REGENCY TEACUP AND SAUCER 0.963 7.45
Germany ROUND SNACK BOXES SET OF 4 FRUITS 0.123 ROUND SNACK BOXES SET OF4 WOODLAND 0.788 3.23

Insights¶

  • High Confidence Levels: EIRE's 96.3% confidence for upselling “GREEN REGENCY TEACUP AND SAUCER” to “ROSES REGENCY TEACUP AND SAUCER” demonstrates strong customer likelihood for specific product upgrades.
  • Simplified Upsell Analysis: Eclat leverages vertical data structures, enabling efficient detection of patterns even in large datasets.

Eclat Algorithm (Cross-Sell)¶

Code Implementation¶

import pandas as pd

# Load the dataset
file_path = '../dataset/DATA602ProjectCleanedNew.csv'
df = pd.read_csv(file_path)

# Filter necessary columns
df = df[['InvoiceNo', 'Description', 'Country']]

# Group transactions by Country and InvoiceNo
transactions_by_country = df.groupby(['Country', 'InvoiceNo'])['Description'].apply(list).reset_index()

# Convert transactions into vertical format
def transactions_to_vertical_format(transactions):
    vertical_format = {}
    for index, transaction in enumerate(transactions):
        for item in transaction:
            if item not in vertical_format:
                vertical_format[item] = set()
            vertical_format[item].add(index)
    return vertical_format

# Eclat algorithm for finding frequent itemsets
def eclat(vertical_format, min_support_count, num_transactions):
    frequent_itemsets = []
    item_support = {}

    # Calculate support for single items
    for item, transactions in vertical_format.items():
        support_count = len(transactions)
        if support_count >= min_support_count:
            support = support_count / num_transactions
            frequent_itemsets.append(((item,), support))
            item_support[item] = support

    # Generate pairs of items
    items = list(vertical_format.keys())
    for i in range(len(items)):
        for j in range(i + 1, len(items)):
            item_1, item_2 = items[i], items[j]
            intersection = vertical_format[item_1].intersection(vertical_format[item_2])
            support_count = len(intersection)

            if support_count >= min_support_count:
                support = support_count / num_transactions
                frequent_itemsets.append(((item_1, item_2), support))

    return frequent_itemsets, item_support

# Generate cross-sell recommendations for each country
cross_sell_recommendations_all_countries = []

for country in transactions_by_country['Country'].unique():
    country_transactions = transactions_by_country[transactions_by_country['Country'] == country]['Description'].tolist()

    # Convert to vertical format
    vertical_format = transactions_to_vertical_format(country_transactions)

    # Calculate the total number of transactions
    num_transactions = len(country_transactions)

    # Set minimum support to 2% of transactions
    min_support = 0.02
    min_support_count = min_support * num_transactions

    # Find frequent itemsets and item support
    frequent_itemsets, item_support = eclat(vertical_format, min_support_count, num_transactions)

    # Generate recommendations with confidence and lift
    for itemset, support in frequent_itemsets:
        if len(itemset) == 2:
            item_1, item_2 = itemset
            support_item_1 = item_support[item_1]
            support_item_2 = item_support[item_2]

            confidence_1_to_2 = support / support_item_1
            confidence_2_to_1 = support / support_item_2
            lift = support / (support_item_1 * support_item_2)

            cross_sell_recommendations_all_countries.append({
                "Country": country,
                "Antecedent": item_1,
                "Consequent": item_2,
                "Support": support,
                "Confidence": confidence_1_to_2,
                "Lift": lift
            })
            cross_sell_recommendations_all_countries.append({
                "Country": country,
                "Antecedent": item_2,
                "Consequent": item_1,
                "Support": support,
                "Confidence": confidence_2_to_1,
                "Lift": lift
            })

# Create a DataFrame for the recommendations and sort by support
cross_sell_df_final = pd.DataFrame(cross_sell_recommendations_all_countries)

print(cross_sell_df_final[['country', 'antecedents', 'support','consequents', 'confidence', 'lift']])

Summary¶

Analyzed transactional data to identify cross-sell opportunities using the Eclat algorithm. It processes a dataset of invoices, descriptions, and countries, grouping transactions by country and invoice numbers. Each transaction is converted into a vertical format to calculate item supports and frequent itemsets based on a minimum support threshold of 2%. Pairwise item combinations are analyzed to compute support, confidence, and lift, which measure the strength of association between items. The results, including cross-sell recommendations for each country, are compiled into a DataFrame and sorted by support to highlight the most promising opportunities. This enables targeted insights into product relationships for improved cross-selling strategies.

Results¶

Country Antecedents Support Consequents Confidence Lift
France PLASTERS IN TIN SPACEBOY 0.081 PLASTERS IN TIN WOODLAND ANIMALS 0.707 4.72
Germany ROUND SNACK BOXES SET OF4 WOODLAND 0.123 ROUND SNACK BOXES SET OF 4 FRUITS 0.505 3.23
United Kingdom LUNCH BAG SPACEBOY DESIGN 0.021 LUNCH BAG DOLLY GIRL DESIGN 0.598 10.32

Insights¶

  • Complementary Products: The 4.72 lift between “PLASTERS IN TIN SPACEBOY” and “PLASTERS IN TIN WOODLAND ANIMALS” in France reveals an opportunity to bundle similar products.
  • Regional Trends: Germany’s consistent cross-sell patterns between “ROUND SNACK BOXES” categories can inform bundle-specific strategies.
  • Strong Pairing in the UK: A lift of 10.32 for “LUNCH BAG SPACEBOY DESIGN” and “LUNCH BAG DOLLY GIRL DESIGN” suggests cross-sell campaigns targeting families.

FP-Growth Algorithm¶

Code Implementation¶

import pandas as pd
from sqlalchemy import create_engine
import psycopg2
from mlxtend.frequent_patterns import fpgrowth, association_rules

# Define PostgreSQL connection parameters
host = "projects-database.cxgcu68ksihx.us-east-1.rds.amazonaws.com"
port = "5432"
dbname = "postgres"
user = "postgres"
password = "DATA602_project"

# Connect to PostgreSQL and fetch data
try:
    connection = psycopg2.connect(
        host=host,
        port=port,
        database=dbname,
        user=user,
        password=password
    )
    cursor = connection.cursor()
    cursor.execute("SELECT * FROM sales_data;")
    results = cursor.fetchall()
    colnames = [desc[0] for desc in cursor.description]
    sales_data = pd.DataFrame(results, columns=colnames)
except Exception as e:
    print(f"Error: {e}")
finally:
    if connection:
        cursor.close()
        connection.close()

# Define helper functions
def frozenset_to_string(fset):
    if isinstance(fset, frozenset):
        return ', '.join(map(str, fset))
    return str(fset)

def filter_up_sell_candidates(up_sell_df, data):
    valid_rows = []
    for _, row in up_sell_df.iterrows():
        antecedents = row['antecedents']
        consequents = row['consequents']
        antecedent_prices = data[data['description'].isin(list(antecedents))]['unitprice'].tolist()
        consequent_prices = data[data['description'].isin(list(consequents))]['unitprice'].tolist()
        if max(antecedent_prices) < min(consequent_prices):
            valid_rows.append(row)
    return pd.DataFrame(valid_rows)

# Cross-sell and up-sell analysis for each country
all_cross_sell_results = []
all_up_sell_results = []

for country in sales_data['country'].unique():
    country_data = sales_data[sales_data['country'] == country]
    basket = (
        country_data.groupby(['invoiceno', 'description'])['quantity']
        .sum().unstack().reset_index().fillna(0).set_index('invoiceno')
    )
    basket_sets = basket.applymap(lambda x: 1 if x > 0 else 0)

    frequent_itemsets = fpgrowth(basket_sets, min_support=0.02, use_colnames=True)
    rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

    # Cross-sell rules
    cross_sell_rules = rules.sort_values(by=['lift', 'confidence'], ascending=[False, False])
    cross_sell_rules['antecedents'] = cross_sell_rules['antecedents'].apply(frozenset_to_string)
    cross_sell_rules['consequents'] = cross_sell_rules['consequents'].apply(frozenset_to_string)
    cross_sell_rules['country'] = country
    all_cross_sell_results.append(cross_sell_rules)

    # Up-sell rules
    rules['antecedent_category'] = rules['antecedents'].apply(lambda x: x)
    rules['consequent_category'] = rules['consequents'].apply(lambda x: x)
    up_sell_candidates = rules[
        rules['antecedent_category'] == rules['consequent_category']
    ].sort_values(by=['confidence', 'lift'], ascending=[False, False])
    filtered_up_sell_candidates = filter_up_sell_candidates(up_sell_candidates, country_data)
    filtered_up_sell_candidates['country'] = country
    all_up_sell_results.append(filtered_up_sell_candidates)

# Combine results
final_cross_sell_results = pd.concat(all_cross_sell_results, ignore_index=True)
final_up_sell_results = pd.concat(all_up_sell_results, ignore_index=True)

# Display results
print("Final Cross-Sell Rules:")
print(final_cross_sell_results[['country', 'antecedents', 'consequents', 'support', 'confidence', 'lift']])

print("\nFinal Up-Sell Candidates:")
print(final_up_sell_results[['country', 'antecedents', 'consequents', 'confidence', 'lift']])

Summary¶

Analyzed transactional data from a PostgreSQL database to identify potential cross-sell and up-sell opportunities using frequent pattern mining algorithms such as FPGrowth and association rules. After connecting to the database and retrieving sales data, the script processes transactions for each country. It creates a basket format suitable for analysis, identifies frequent itemsets, and generates association rules ranked by metrics such as lift and confidence.

For cross-sell opportunities, the rules with high lift and confidence are selected and labeled with their respective countries. Up-sell opportunities are identified by filtering the rules where the antecedent and consequent belong to the same product category and the antecedent's unit price is lower than the consequent's unit price. Both cross-sell and up-sell results are aggregated and displayed, providing valuable insights for enhancing sales strategies.

FP-Growth Algorithm (Up Sell)¶

Results¶

Country Antecedents Support Consequents Confidence Lift
France PACK OF 6 SKULL PAPER CUPS 0.042 PACK OF 6 SKULL PAPER PLATES 0.714 14.29
United Kingdom GREEN REGENCY TEACUP AND SAUCER 0.028 ROSES REGENCY TEACUP AND SAUCER 0.760 18.96

Insights¶

  • High Lift for Upsell: The lift of 14.29 in France between “PACK OF 6 SKULL PAPER CUPS” and “PACK OF 6 SKULL PAPER PLATES” highlights clear upsell opportunities for related products.
  • Scalability: FP-Growth's computational efficiency makes it ideal for processing large transactional datasets.

FP-Growth Algorithm (Cross-Sell)¶

Results¶

Country Antecedents Support Consequents Confidence Lift
United Kingdom PINK REGENCY TEACUP AND SAUCER 0.024 GREEN REGENCY TEACUP AND SAUCER 0.811 22.10
Germany ALARM CLOCK BAKELIKE GREEN 0.029 ALARM CLOCK BAKELIKE RED 0.654 13.17

Insights¶

  • Complementary Pairs: For example, the high lift (22.10) for “PINK REGENCY TEACUP AND SAUCER” and “GREEN REGENCY TEACUP AND SAUCER” highlights ideal candidates for cross-sell promotions.
  • Efficient Cross-Sell Discovery: FP-Growth identifies these patterns efficiently, making it suitable for dynamic retail environments.

3. Comparison of Models in Context of the Project¶

Model Strengths Limitations Specific Contribution to Project
Apriori Interpretable, thorough, ideal for small data Computationally expensive for large datasets Ideal for identifying upsell and cross-sell rules with detailed confidence and lift metrics.
Eclat Efficient for large datasets, simple logic Limited interpretability Best for quick, country-specific insights where vertical data formats simplify computation.
FP-Growth Highly scalable, fast for large datasets Requires preprocessing and tuning Excels in analyzing large transactional data, providing quick and actionable insights across countries.

4. Conclusion¶

The combined application of Apriori, Eclat, and FP-Growth has yielded comprehensive insights into upselling and cross-selling opportunities across multiple countries.

Key findings include:

  • High-Impact Opportunities: Lift and confidence values highlight specific products with strong upselling and cross-selling potential. Examples include upselling tea sets in EIRE and cross-selling children’s lunch bags in the UK.
  • Region-Specific Insights: Customized strategies can be developed for different regions, such as bundling snack boxes in Germany or promoting tea-related products in the UK.
  • Model Utility:
    • Apriori: Best for interpretable, detailed relationship analysis.
    • Eclat: Ideal for efficient pattern discovery in regional datasets.
    • FP-Growth: Excellent for large-scale, high-volume transactional analysis.

These findings enable businesses to enhance their marketing strategies, improve customer satisfaction, and drive revenue growth through targeted upselling and cross-selling initiatives.

Visualization¶

image.png

Apriori Support Overview:¶

  • Highest Support: Germany led with the highest average support across all regions, closely followed by EIRE and France.
  • Consistent Performance: France and Germany demonstrated similar levels of support, indicating balanced association patterns.
  • United Kingdom: While trailing slightly behind, the UK still showed strong engagement with featured antecedent-consequent pairs.

Support Insights:¶

  • Average support values ranged from 0.024 to 0.027, showcasing consistent opportunities for cross-sell and upsell strategies across regions.

Actionable Insights:¶

  1. EIRE: Focus promotional efforts on highly supported combinations to capitalize on strong customer engagement.
  2. Germany and France: Maintain targeted campaigns around consistent product associations to boost sales further.
  3. United Kingdom: Reinforce awareness for cross-sell opportunities to narrow the gap in average support.

image.png

Germany Highlights:¶

  • Top pairs: Round Snack Boxes Set of 4 Fruits and Round Snack Boxes Set of 4 Woodland achieved the highest confidence, indicating strong potential for cross-sell campaigns.
  • Other strong performers: Red Retrospot Mini Cases paired with Lunch Boxes with Cutlery Retrospot and Strawberry Lunch Boxes highlighted their appeal across customer segments.

Confidence Range:¶

  • Confidence levels ranged from 0.35 to 0.50, showcasing dependable cross-sell opportunities for snack boxes and lunch accessories.

Actionable Insights:¶

  1. Snack Box Bundles: Promote Round Snack Boxes in combination with lunch accessories like Lunch Boxes with Cutlery to maximize cross-sell effectiveness.
  2. Regional Campaigns: Leverage high-confidence items, such as Red Retrospot Mini Cases, in targeted marketing strategies to boost engagement.
  3. Product Pairing Focus: Use strong pairs like Lunch Box I Love London and Woodland Snack Boxes as anchors for upsell efforts.

image.png

EIRE Highlights:¶

  • Top pairs: Regency Milk Jug Pink and Regency Tea Plate Pink consistently achieved the highest confidence, making them strong cross-sell candidates.
  • Other strong performers: Pink Regency Teacup and Saucer and Roses Regency Teacup and Saucer displayed excellent synergy, appealing to tea set enthusiasts.

Confidence Range:¶

  • Confidence scores peaked at 1.0, reflecting flawless associations between key antecedents and consequents.

Actionable Insights:¶

  1. Tea Set Promotions: Bundle Regency Milk Jugs with Tea Plates and Teacups to capture customer interest effectively.
  2. Upsell Opportunities: Focus on enhancing visibility for complementary products like Pink Regency Teacups to drive multi-item purchases.
  3. Highlight Strong Pairs: Use high-confidence combinations like Milk Jugs and Tea Plates as flagship offers in marketing campaigns.

image.png

Germany Highlights:¶

  • Top pairs: Round Snack Boxes Set of 4 Woodland paired with Plasters in Tin (Circus Parade/Woodland Animals) demonstrated high confidence, highlighting strong cross-sell potential.
  • Other strong performers: Spaceboy Lunch Box and Pack of 20 Napkins Pantry Design effectively complemented related antecedents.

Confidence Range:¶

  • Confidence values slightly more than 0.6, indicating reliable and consistent cross-sell opportunities.

Actionable Insights:¶

  1. Snack Box Bundles: Promote Round Snack Boxes alongside Plasters in Tin to enhance customer engagement.
  2. Lunch Box Campaigns: Feature Spaceboy Lunch Box with complementary napkins or other thematic accessories.
  3. Marketing Strategies: Leverage high-confidence pairs to design targeted campaigns, emphasizing reliability and value in cross-sell combinations.

Confidence Score Apriori up sell.png

Consequents with High Confidence

  • REGENCY SUGAR BOWL AND SAUCER and PACK OF 6 SKULL PAPER PLATES have high confidence scores, indicating strong purchase likelihood with their associated antecedents.
  • Other strong performers: PINK REGENCY TEACUP AND SAUCER and GREEN REGENCY TEACUP AND SAUCER are consistent across regions.

Metrics by Country

  • EIRE: GREEN and PINK REGENCY TEACUP AND SAUCER have the highest confidence scores, reflecting strong upsell opportunities.
  • France: PACK OF 6 SKULL PAPER PLATES and REGENCY SUGAR BOWL GREEN are reliable upsell candidates with competitive confidence scores.
  • Comparison: Confidence scores are high in both regions, with EIRE slightly outperforming France in certain categories.

Key Observations

  • Consistency: Confidence values are stable (0.7–0.8), indicating reliable upsell potential.
  • Popular Categories: Tea sets and home decor items (e.g., teacups, saucers, bowls) are customer favorites.

Actionable Insights

  • Promote Regency tea sets and complementary items in EIRE.
  • Craft upsell strategies for PACK OF 6 SKULL PAPER PLATES in France.
  • Tailor strategies to regional preferences: teacup and saucer bundles in EIRE, paper plates in France.
  • Focus on high-confidence consequents to maximize upselling effectiveness and customer satisfaction.

image.png

United Kingdom Highlights:¶

  • Top pairs: Red/Green Alarm Clock Bakelike, as well as Pink/Green Regency Teacups, showed high confidence, reflecting strong cross-sell potential.
  • Other strong items include Lunch Bag Red Polkadot/Lunch Bag Pink Polkadot and Gardeners Kneeling Pads, appealing to niche customer preferences.

Germany Highlights:¶

  • Key combinations: Blue/Pink Happy Birthday Buntings and Jumbo Bags (Red Retrospot, Woodland, Pink Polkadot) performed exceptionally well.
  • Snack Boxes and Children’s Cutlery (Circus Parade/Spaceboy) emerged as reliable cross-sell candidates.

Confidence Range:¶

  • Confidence scores ranged from 0.4769 to 0.6568, indicating consistent cross-sell opportunities across regions.

Actionable Insights:¶

  1. United Kingdom: Focus on bundling Lunch Bags and Teacups for strong results.
  2. Germany: Promote Jumbo Bags and Snack Box sets in cross-sell campaigns.
  3. Highlight high-confidence items as anchors to drive sales across both regions.

image.png

France Highlights:¶

  • Top pairs: Alarm Clock Bakelike (Red, Green, Pink, Chocolate) consistently showed high cross-sell confidence across various combinations.
  • Plasters in Tin Spaceboy paired well with the alarm clocks, indicating a strong potential for bundled promotions.

EIRE Highlights:¶

  • Key combinations: Baking Sets (Spaceboy Design, Retrospot) and Hot Water Bottles (Chocolate, English Rose) were prominent cross-sell items.
  • Additional strong performers included Feltcraft items (Cushions, Dolls) and Diner Wall Clocks (Red, Blue, Ivory).

Confidence Range:¶

  • Confidence reached a maximum of 1.0, showcasing highly reliable cross-sell pairs in both France and EIRE.

Actionable Insights:¶

  1. France: Develop bundled offers around Alarm Clocks with complementary items like Plasters in Tin Spaceboy.
  2. EIRE: Focus on promoting Baking Sets and Feltcraft products to enhance cross-sell strategies.
  3. Leverage high-confidence combinations to maximize sales and customer satisfaction in both regions.

Support Score Apriori Cross sell.png

Regional Performance:

  • EIRE: Strong affinity for Regency products like teacups and milk jugs.
  • Germany: Diverse interests with focus on functional and themed items like snack boxes and Woodland Animal Plasters.

Metrics Overview:

  • Support Range: Moderate scores (0.035–0.055) indicate significant co-purchase patterns.
  • Regional Consistency: Regency products dominate in EIRE, while Germany shows varied preferences.

Actionable Insights:

  • EIRE: Bundle REGULAR MILK JUG PINK and PINK REGENCY TEACUP AND SAUCER to leverage Regency product popularity.

  • Germany: Promote combinations like ROUND SNACK BOXES SET OF 4 WOODLAND and PLASTERS IN TIN WOODLAND ANIMALS for themed cross-selling.

  • Bundle Development: Create tailored bundles for high-support pairs in each region.

  • Marketing Strategy:

    • Focus on Regency products in EIRE.
    • Broaden product offerings in Germany to target diverse customer segments.

image.png

Germany Highlights:¶

  • Top consequents: Round Snack Boxes Set of 4 Woodland paired effectively with Charlotte Bag Apples Design and Coffee Mug Apples Design, showcasing strong customer interest in thematic combinations.
  • Other notable pairs: Pack of 20 Napkins Pantry Design and Red Retrospot Cup emerged as reliable cross-sell options.

Support Range:¶

  • Support values are slightly below 0.375, indicating consistent cross-sell opportunities for the highlighted items.

Actionable Insights:¶

  1. Focus on Themes: Market Charlotte Bag Apples Design and Coffee Mug Apples Design alongside Round Snack Boxes to attract thematic buyers.
  2. Expand Napkin Campaigns: Leverage the popularity of Pack of 20 Napkins to bundle with snack boxes for enhanced value.
  3. Promotional Strategies: Utilize high-support items as anchors in targeted marketing initiatives to drive cross-sell success.

image.png

France Highlights:¶

  • Top antecedents: Plasters in Tin Woodland Animals frequently paired with Plasters in Tin Circus Parade, showcasing a strong cross-sell relationship.
  • Other notable combinations: Plasters in Tin Spaceboy and Alarm Clock Bakelike (Red/Green) demonstrated high support, appealing to diverse customer preferences.

Support Range:¶

  • Support values reached a maximum of 0.08, reflecting significant customer interest in the highlighted item pairs.

Actionable Insights:¶

  1. Bundle Strategy: Leverage Plasters in Tin Woodland Animals as a key item to promote related themes like Circus Parade and Spaceboy.
  2. Expand Clock Campaigns: Highlight the popularity of Alarm Clock Bakelike (Red/Green) in targeted offers to boost sales.
  3. Theme-Based Promotions: Focus on thematic bundles combining plasters and clocks to capitalize on customer purchasing trends.

image.png

EIRE Highlights:¶

  • Top antecedents: English Rose Hot Water Bottle paired effectively with Chocolate Hot Water Bottle, demonstrating strong customer preference for warm-themed items.
  • Other notable combinations: Feltcraft Cushion Butterfly with Feltcraft Cushion Owl and Recipe Box Blue Sketchbook Design with Recipe Box Pantry Yellow Design showed consistent support.

Support Range:¶

  • Support values are more than 0.025, indicating moderate yet impactful associations between the highlighted product pairs.

Actionable Insights:¶

  1. Warm Product Bundles: Focus on promoting hot water bottles as a pair to enhance sales during colder months.
  2. Crafted Themes: Utilize Feltcraft Cushion pairs to target customers interested in handmade and artisanal products.
  3. Kitchen Organization: Leverage the interest in Recipe Boxes to craft campaigns appealing to customers looking for storage solutions.

image.png

United Kingdom Highlights:¶

  • Strong antecedents: Alarm Clock Bakelike Red and Alarm Clock Bakelike Green showcased reliable support when paired together.
  • Popular combinations: Jumbo Bag Pink Polkadot with Jumbo Bag Red Retrospot and Lunch Bag Pink Polkadot with Lunch Bag Red Retrospot highlighted significant customer interest in colorful and themed products.

Support Range:¶

  • Maximum support peaked at 0.030, indicating meaningful associations among featured product pairs.

Actionable Insights:¶

  1. Color Coordination: Promote matching alarm clocks and lunch bags as bundled sets to target design-conscious customers.
  2. Themed Packaging: Leverage the popularity of polkadot jumbo bags to enhance gift-ready product offerings.
  3. Highlight Strong Pairs: Use cross-sell strategies focusing on Alarm Clocks and Polkadot Bags to drive additional sales.

image.png

Eclat Support Overview:¶

  • Top Performing Region: France led with the highest average support, confirming its strong cross-sell and upsell potential.
  • Close Contenders: EIRE and Germany displayed comparable support levels, showcasing consistent product associations.
  • United Kingdom: Although slightly behind other regions, the UK demonstrated steady engagement, reflecting room for optimization.

Support Highlights:¶

  • Average support values were tightly clustered, ranging from 0.024 to 0.029, indicating reliable opportunities across all regions.

Actionable Insights:¶

  1. France: Prioritize campaigns in this region to maximize returns on high-support associations.
  2. EIRE and Germany: Leverage stable support patterns to maintain momentum and boost customer satisfaction.
  3. United Kingdom: Enhance visibility and bundling strategies to improve support metrics and align with other regions.

image.png

United Kingdom Support Analysis:¶

  • Top Consequent Items: The Alarm Clock Bakelike series (Red and Green) and Jumbo Bags (Pink Polkadot, Red Retrospot) exhibited the highest support levels, indicating frequent pairings with their antecedents.
  • Strong Pairings: Lunch Bags (Pink Polkadot, Red Retrospot) maintained a steady presence, showcasing reliable customer demand.

Support Insights:¶

  • Max Support: Values ranged around 0.030, reflecting stable cross-sell opportunities across multiple product pairs.

Actionable Recommendations:¶

  1. Alarm Clock Bakelike Series: Bundle Red and Green variants to capitalize on strong support.
  2. Jumbo Bags and Lunch Bags: Focus on themed promotions combining these items to boost sales.
  3. Use high-support antecedents to drive targeted campaigns, leveraging established customer preferences.

Confidence Eclate Up sell.png

EIRE's Dominance:

  • Products from EIRE, particularly Regency items, show exceptional upselling potential.
  • Key combinations include PINK REGENCY TEACUP AND SAUCER, REGENCY MILK JUG PINK, REGENCY TEA PLATE PINK, and ROSES REGENCY TEACUP AND SAUCER.
  • GREEN REGENCY TEACUP AND SAUCER is frequently paired with other Regency tea items.

Metrics Insights:

  • High Confidence (1.0): All product pairs achieve perfect co-purchasing behavior, indicating every antecedent purchase leads to the consequent purchase.
  • Product Group Trends: Regency products dominate, with frequent bundling of Milk Jugs, Tea Plates, and Teacups.

Actionable Insights

  • Targeted Marketing Campaigns: Launch Regency bundles in EIRE, supported by bundle discounts or loyalty incentives.
  • Cross-Category Upselling: Pair Regency items with complementary products, such as teas or kitchen accessories, to increase basket value.
  • Retail Display Optimization: Arrange Regency products together in physical or online stores for bundled purchase encouragement.
  • Customer Feedback Loop: Gather feedback to refine existing products or introduce new items in the Regency line.

image.png

Germany Highlights:¶

  • Top pairs: Round Snack Boxes (Woodland and Fruits variants) consistently achieved high support, indicating strong cross-sell potential.
  • Plasters in Tin Spaceboy paired effectively with both Snack Boxes and the Spaceboy Lunch Box, showcasing reliable bundling opportunities.

Support Range:¶

  • Support values ranged from 0.02844 to 0.12322, highlighting consistent cross-sell patterns in the German market.

Actionable Insights:¶

  1. Prioritize bundling strategies for Round Snack Boxes (Woodland and Fruits) in cross-sell campaigns.
  2. Promote Plasters in Tin Spaceboy alongside complementary items such as Snack Boxes and Lunch Boxes.
  3. Leverage high-support items to enhance product pairing and drive customer satisfaction.

image.png

France Highlights:¶

  • Top pairs: Lunch Bag Woodland paired with Spaceboy Design and Red Retrospot, showing robust support for cross-sell opportunities.
  • Other key combinations include Plasters in Tin Woodland Animals and Round Snack Boxes Set of 4 Fruits, enhancing bundling strategies.

United Kingdom Highlights:¶

  • Top pairs: Green and Roses Regency Teacups demonstrated strong support, making them ideal for cross-sell promotions.
  • Additional strong performers include Alarm Clock Bakelike (Red/Green) and Lunch Bag Woodland with Spaceboy Design, appealing to diverse customer needs.

Support Range:¶

  • Support values ranged from 0.02222 to 0.06667, reflecting steady cross-sell potential in both regions.

Actionable Insights:¶

  1. France: Focus on promoting Lunch Bag Woodland and its combinations to maximize cross-sell effectiveness.
  2. United Kingdom: Highlight Regency Teacups and Alarm Clocks in upselling campaigns.
  3. Leverage top-performing pairs to design tailored marketing strategies that cater to regional preferences.

image.png

EIRE Highlights:¶

  • Top pairs: Green Regency Teacup and Saucer paired with Roses Regency Teacup and Saucer emerged as the strongest cross-sell candidate with the highest support.
  • Other combinations: Ivory Kitchen Scales linked with Red Kitchen Scales and Regency Sugar Bowl Green paired with Regency Milk Jug Pink showed reliable support for upsell potential.

Support Range:¶

  • Support values spanned from 0.06 to 0.11, indicating consistent cross-sell opportunities in the region.

Actionable Insights:¶

  1. Focus on Teacups: Highlight Green and Roses Regency Teacups in bundling promotions.
  2. Upsell Strategy: Pair Kitchen Scales and Regency Sugar Bowls with complementary items to enhance customer value.
  3. Leverage high-support combinations to design region-specific marketing campaigns.

Pairs by Support Score - Eclat Cross-Sell¶

Support Score Eclat Antecedent and Cross Sell pair.png

EIRE Dominance in Regency Products:

  • GREEN REGENCY TEACUP AND SAUCER → ROSES REGENCY TEACUP AND SAUCER has the highest support (~0.123), indicating frequent co-purchases.
  • GREEN REGENCY TEACUP AND SAUCER → PINK REGENCY TEACUP AND SAUCER also shows strong support, highlighting customer preference for Regency items.

Germany's Snack Box Bundles:

  • ROUND SNACK BOXES SET OF 4 WOODLAND → ROUND SNACK BOXES SET OF 4 FRUITS is the leading pair, reflecting high co-purchasing behavior for snack boxes.

France's Regional Preferences:

  • PLASTERS IN TIN SPACEBOY → PLASTERS IN TIN WOODLAND ANIMALS highlights France’s preference for themed products, with high support scores.

Actionable Insights

  • EIRE Marketing Strategy:

    • Bundle GREEN, PINK, and ROSES REGENCY TEACUPS AND SAUCERS.
    • Offer discounts for Regency sets to boost customer loyalty.
  • Germany Campaigns:

    • Promote snack box bundles (e.g., "Woodland" and "Fruits").
    • Introduce gifting options or multi-pack discounts during peak seasons.
  • France Targeted Promotions:

    • Bundle PLASTERS IN TIN SPACEBOY with PLASTERS IN TIN WOODLAND ANIMALS to cater to thematic preferences.
    • Expand the product line with similar themes to engage this audience further.

image.png

EIRE Confidence Analysis:¶

  • Top Confidence Pair: English Rose Hot Water Bottle paired with Chocolate Hot Water Bottle achieved the maximum confidence score of 1.0, signaling a near-guaranteed pairing.
  • Strong Candidates: The Regency Tea Plate series (Pink, Green, and Roses) paired with Green Regency Teacups consistently showed high confidence, emphasizing their strong association.
  • Additional Highlight: Biscuit Tin Vintage Red paired with Round Cake Tin Vintage Red demonstrated significant confidence levels, reinforcing its cross-sell potential.

Confidence Insights:¶

  • Range: Confidence levels ranged from 0.7 to 1.0, indicating robust and reliable product pairings.

Actionable Insights:¶

  1. Leverage Maximum Confidence Pair: Highlight the English Rose Hot Water Bottle combination in marketing campaigns.
  2. Bundle Regency Products: Create promotional bundles featuring Regency Tea Plates and Teacups to capitalize on strong customer preferences.
  3. Focus on Vintage Tins: Use the pairing of Biscuit Tin Vintage Red to drive sales in the home decor and gifting categories.

image.png

France Confidence Analysis:¶

  • Top Confidence Pair: Pack of 6 Skull Paper Cups and Pack of 6 Skull Paper Plates achieved the highest confidence score, showcasing significant cross-sell potential.
  • Notable Associations:
    • Dolly Girl Lunch Box and its related combinations (e.g., Lunch Bag Dolly Girl Design) showed consistent confidence, making it a strong candidate for bundled promotions.
    • Lunch Box with Cutlery Retrospot demonstrated reliable pairings with various lunch bag designs.

Confidence Insights:¶

  • Range: Confidence values spanned from 0.3 to 0.7, highlighting diverse yet impactful cross-sell opportunities.

Actionable Insights:¶

  1. Promote Skull Paper Products: Leverage the popularity of the Pack of 6 Skull Paper Cups and Plates for seasonal campaigns or party-themed bundles.
  2. Create Lunch Box Bundles: Capitalize on the strong confidence in Dolly Girl Lunch Box and Lunch Bag Retrospot combinations to target school and office-goers.
  3. Optimize Marketing: Highlight these pairings in digital and in-store promotions to maximize cross-sell effectiveness.

Support Score FP growth up sell.png

Country-Level Performance:

  • EIRE and France exhibit high support for Regency products, highlighting the popularity of branded teacups and saucers.
  • REGENCY TEA PLATE PINK performs consistently in both regions, suggesting a strong pairing opportunity.

Support Distribution:

  • Uniform average support for antecedents leading to key consequents across EIRE and France, indicating balanced upselling potential.

Actionable Insights:

  • Targeted Regional Campaigns:

    • Focus on PACK OF 6 SKULL PAPER PLATES in France for promotions based on high regional support.
    • Prioritize Regency-branded products like PINK REGENCY TEACUP AND SAUCER in EIRE to align with local preferences.
  • Product Bundling Opportunities:

    • Bundle GREEN REGENCY TEACUP AND SAUCER with REGENCY TEA PLATE PINK to leverage frequent customer pairings.
  • Consistency Across Countries:

    • Develop standardized marketing campaigns around REGENCY TEA PLATE PINK to target customers in both EIRE and France effectively.

image.png

EIRE Highlights:¶

  • Top Performer: Roses Regency Teacup and Saucer, Green achieved the highest confidence score, making it the most promising product for cross-sell opportunities.

France Highlights:¶

  • Top Performer: Alarm Clock Bakelike (Green and Red) stood out as the most reliable product for cross-sell opportunities, reflecting high confidence levels across pairings.

Actionable Insights:¶

  1. EIRE: Prioritize promoting Roses Regency Teacup and Saucer, Green in upsell campaigns while bundling with other Regency tea sets for higher customer engagement.
  2. France: Focus cross-sell campaigns around Alarm Clock Bakelike variants, leveraging their strong confidence scores to maximize sales.
  3. Tailor marketing efforts to emphasize these high-performing products, ensuring region-specific preferences are met effectively.

image.png

Germany Highlights:¶

  • Top Performer: Strawberry Lunch Box with Cutlery paired with Round Snack Boxes Set of 4 Woodland emerged as the highest confidence pair, indicating strong cross-sell potential.
  • Reliable Bundles: Plasters in Tin Woodland Animals and Round Snack Boxes Set of 4 Woodland continue to perform well, showcasing their effectiveness in driving customer purchases.

United Kingdom Highlights:¶

  • Key Drivers: Green Regency Teacup and Saucer and Alarm Clock Bakelike Green and Red consistently demonstrated high confidence levels, making them standout performers in the market.
  • Emerging Pairs: Bundles involving Lunch Bag Spaceboy Design and Lunch Bag Cars Blue also showed promising cross-sell opportunities.

Actionable Insights:¶

  1. Germany: Prioritize promoting the Strawberry Lunch Box with Cutlery alongside Round Snack Boxes in cross-sell campaigns to maximize sales.
  2. United Kingdom: Develop targeted marketing campaigns around Green Regency Teacup and Saucer and Alarm Clocks to capitalize on their strong market appeal.
  3. Highlight these high-confidence pairs in promotional efforts to encourage additional purchases and improve overall sales performance.

image.png

FP Growth - Regional Support Insights¶

  • France and Germany exhibit the highest average support values, indicating robust purchase patterns for frequent itemsets in these regions.
  • EIRE closely follows, showing a solid performance in support metrics, with a consistent customer preference for specific item combinations.
  • United Kingdom records the lowest average support, signaling potential for improvement in frequent pair strategies.

Actionable Recommendations¶

  1. Capitalize on High Support in France and Germany: Focus on maintaining and enhancing combinations with proven performance in these markets.
  2. Strengthen EIRE's Performance: Leverage existing trends and explore opportunities to further boost frequently purchased combinations.
  3. Revitalize the UK Market: Develop strategies such as bundling and personalized promotions to improve support for key items and enhance customer engagement.

Insights and Conclusions¶

Our project showed different behavioral patterns in customer purchasing, with different trends in various countries. These findings raise the need for customized upselling and cross-selling strategies. For example, in France, highly confident and high-lift pairs such as "PACK OF 6 SKULL PAPER CUPS" to "PACK OF 6 SKULL PAPER PLATES" - with a lift of 14.29 and confidence of 71.4% have the potential for upselling. Similarly, cross-sell opportunities, such as "SPACEBOY CHILDREN'S CUP" to "DOLLY GIRL CHILDREN'S BOWL" (23.43 lift, 84.6% confidence), indicate how complementary products can reinforce sales. Regional purchasing trends further highlight the need for localized approaches-things that EIRE customers want to buy are branded Regency, while German customers will go for more practical items bundled together, such as snack boxes.

The comparative analysis of the algorithms, specifically Apriori, FP-Growth, and Eclat, showed quite clearly which one was better in the case of different aspects. The Apriori algorithm provided insight with transparent and interpretable results, which were necessary in understanding product affinities. FP-Growth was good for larger sets of data, offering efficiency with high confidence, while Eclat was very fast and simple, thus ideal for a targeted, streamlined cross-selling strategy.

This integration of algorithms demonstrated various strengths and weaknesses for each, thus driving data-informed, region-specific recommendations toward better customer engagement and revenue. By identifying high-lift product pairs and understanding regional purchasing habits, the analysis offers valuable insights into the upsell and cross-sell potential.

Actionable Insights:

Upsell Potential: The products with high lift ratios, such as those identified in France, are promising opportunities to boost sales by suggesting complementary upgrades. For example, suggestions for premium alternatives or bundled items have shown to greatly increase transaction value. Regional Strategies: EIRE: Promote Regency-branded products; known to be a preference in this region. Germany: Exploit the popularity of 'bundled and practical' items, such as snack boxes. Cross-Sell Campaigns: Use high-confidence product pairing to improve cross-sell outcomes. These products should be recommended based on complementary needs that enhance the customer's overall purchase experience.

Future Directions:

Going forward, the next step of the project involves real-time data integration for dynamic recommendations to give customers instant contextually relevant product suggestions. Enriching the dataset by adding demographic insights and capturing seasonality will provide greater accuracy, thus enabling even finer-grained recommendation strategies based on time and customer profiles.

Through integration, by this method, the retailers use all the data-driven strategies required in boosting customer satisfaction, fastened purchase experiences, and enabling business growth long-term based on specific upselling and cross-sell techniques.

Data Science Ethics¶

The main ethical issues our project raises are bias in data collection and algorithmic fairness. The dataset could be biased-for instance, overrepresentation of certain regions or customer segments-and might present a skew in insight to the side of larger datasets from more active areas. Besides, recommendation algorithms may prioritize some products or customer groups and thereby put smaller brands or niche items at a disadvantage. Countrywide grouping of data was used to ensure that insights remained specific to the region, with metrics such as support, confidence, and lift calculated on an independent basis for each region. Interpretable algorithms like Apriori have been employed to ensure transparency; therefore, output has been carefully documented and validated for fairness and accuracy.

In ensuring the project's fairness and transparency,. The recommendations were verified using several metrics, and a comparative analysis was done with the Apriori, FP-Growth, and ECLAT algorithms to assure strong and consistent results. Upselling and cross-selling methods are developed in such a way that the customer satisfaction will be higher by recommending genuinely valuable and complementary products without exploiting the customers. Future analyses should consider more variable regions and customer profiles for inclusivity. This is very important because as the system scales to real-time recommendations, much algorithmic scalability and fairness are at stake.

Thank You !